| SOCR ≫ | DSPA ≫ | Topics ≫ |
Deep learning is a special branch of machine learning using a collage of algorithms to model high-level motifs in data. Deep learning resembles the biological communications between brain neurons in the central nervous system (CNS), where synthetic graphs represent the CNS network as nodes/states and connections/edges between them. For instance, in a simple synthetic network consisting of a pair of connected nodes, an output send by one node is received by the other as an input signal. When more nodes are present in the network, they may be arranged in multiple levels (like a multiscale object) where the \(i^{th}\) layer output serves as the input of the next \((i+1)^{st}\) layer. The signal is manipulated at each layer and sent as a layer output downstream and interpreted as an input to the next, \((i+1)^{st}\) layer, and so forth. Deep learning relies on multiple layers of nodes and many edges linking the nodes forming input/output (I/O) layered grids representing a multiscale processing network. At each layer, linear and non-linear transformations are converting inputs into outputs.
In this chapter, we explore the R deep learning package MXNetR and demonstrate state-of-the-art deep learning models utilizing CPU and GPU for fast training (learning) and testing (validation). Other powerful deep learning frameworks include TensorFlow, Theano, Caffe, Torch, CNTK and Keras.
Neural Networks vs. Deep Learning: Deep Learning is a machine learning strategy that learns a deep multi-level hierarchical representation of the affinities and motifs in the dataset. Machine learning Neural Nets tend to use shallower network models. Although there are no formal restrictions on the depth of the layers in a Neural Net, few layers are commonly utilized. Recent methodological, algorithmic, computational, infrastructure and service advances overcome previous limitations. In addition, the rise of Big Data accelerated the evolution of classical Neural Nets to Deep Neural Nets, which can now handle lots of layers and many hidden nodes per layer. The former is a precursor to the latter, however, there are also non-neural deep learning techniques. For example, syntactic pattern recognition methods and grammar induction discover hierarchies.
Review Chapter 10 (Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines) prior to proceeding.
A perceptron is an artificial analogue of a neuronal brain cell that calculates a weighted sum of the input values and outputs a thresholded version of that result. For a bivariate perceptron, \(P\), having two inputs, \((X,Y)\), denote the weights of the inputs by \(A\) and \(B\), respectively. Then, the weighted sum could be represented as: \[W = A X + B Y.\]
At each layer \(l\), the weight matrix, \(W^{(l)}\), has the following properties:
Neuronal cells fire depending on the presynaptic inputs to the cell which causes constant fluctuations of the neuronal the membrane - depolarizing or hyperpolarizing, i.e., the cell membrane potential rises or falls. Similarly, perceptrons rely on thresholding of the weight-averaged input signal, which for biological cells corresponds to voltage increases passing a critical threshold. Perceptrons output non-zero values only when the weighted sum exceeds a certain threshold \(C\). In terms of its input vector, \((X,Y)\), we can describe the output of each perceptron (\(P\)) by:
\[ Output(P) = \left\{ \begin{array} {} 1, & if A X + B Y > C \\ 0, & if A X + B Y \leq C \end{array} \right. .\]
Feed-forward networks are constructed as layers of perceptrons where the first layer ingests the inputs and the last layer generates the network outputs. The intermediate (internal) layers are not directly connected to the external world, and are called hidden layers. In fully connected networks, each perceptron in one layer is connected to every perceptron on the next layer enabling information “fed forward” from one layer to the next. There are no connections between perceptrons in the same layer.
Multilayer perceptrons (fully-connected feed-forward neural network) consist of several fully-connected layers representing an input matrix \(X_{n \times m}\) and a generated output matrix \(Y_{n \times k}\). The input \(X_{n,m}\) is a matrix encoding the \(n\) cases and \(m\) features per case. The weight matrix \(W_{m,k}^{(l)}\) for layer \(l\) has rows (\(i\)) corresponding to the weights leading from all the units \(i\) in the previous layer to all of the units \(j\) in the current layer. The product matrix \(X\times W\) has dimensions \(n\times k\).
The hidden size parameter \(k\), the weight matrix \(W_{m \times k}\), and the bias vector \(b_{n \times 1}\) are used to compute the outputs at each layer:
\[Y_{n \times k}^{(l)} =f_k^{(l)}\left ( X_{n \times m} W_{m \times k}+b_{k \times 1} \right ).\]
The role of the bias parameter is similar to the intercept term in linear regression and helps improve the accuracy of prediction by shifting the decision boundary along \(Y\) axis. The outputs are fully-connected layers that feed into an activation layer to perform element-wise operations. Examples of activation functions that transform real numbers to probability-like values include:
relu, Rectified Linear Unit) function, which outputs the \(\max(0, input)\),The final fully-connected layer may be hidden of size equal to the number of classes in the dataset and may be followed by a softmax layer mapping the input into a probability score. For example, if a size \({n\times m}\) input is denoted by \(X_{n\times m}\), then the probability scores may be obtained by the softmax transformation function, which maps real valued vectors to vectors of probabilities:
\[\left ( \frac{e^{x_{i,1}}}{\displaystyle\sum_{j=1}^m e^{x_{i,j}}},\ldots, \frac{e^{x_{i,m}}}{\displaystyle\sum_{j=1}^m e^{x_{i,j}}}\right ).\]
Below is a schematic of fully-connected feed-forward neural network of nodes \[ \{ a_{j=node\ index, l=layer\ index} \}_{j=1, l=1}^{n_j, 4}.\]
The plot above illustrates the key elements in the action potential, or activation function, calculations and the corresponding training parameters: \[a_{node=k,layer=l} = f\left ( \displaystyle\sum_{i}{w_{k,i}^l \times a_{i}^{l-1} +b_{k}^l} \right ),\] where:
These parameters may be estimated using different techniques (e.g., using least squares, or stochastically using steepest decent methods) based on the training data.
There are parallels between biology (neuronal cells) and the mathematical models (perceptrons) for neural network representation. The human brain contain about \(10^{11}\) neuronal cells connected by approximately \(10^{15}\) synapses forming the basis of our functional phenotypes. The schematic below illustrates some of the parallels between brain biology and the mathematical representation using synthetic neural nets. Every neuronal cell receives multi-channel (afferent) input from its dendrites, generates output signals and disseminates the results via its (efferent) axonal connections and synaptic connections to dendrites of other neurons.
The perceptron is a mathematical model of a neuronal cell that allows us to explicitly determine algorithmic and computational protocols transforming input signals into output actions. For instance, signal arriving through an axon \(x_0\) is modulated by some prior weight, e.g., synaptic strength, \(w_0\times x_0\). Internally, within the neuronal cell, this input is aggregated (summed, or weight-averaged) with inputs from all other axons. Brain plasticity suggests that synaptic strengths (weight coefficients \(w\)) are strengthened by training and prior experience. This learning process controls the direction and strength of influence of neurons on other neurons. Either excitatory (\(w>0\)) or inhibitory (\(w\leq 0\)) influences are possible. Dendrites and axons carry signals to and from neurons, where the aggregate responses are computed and transmitted downstream. Neuronal cells only fire if action potentials exceed a certain threshold. In this situation, a signal is transmitted downstream through its axons. The neuron remains silent, if the summed signal is below the critical threshold.
Timing of events is important in biological networks. In the computational perceptron model, a first order approximation may ignore the timing of neuronal firing (spike events) and only focus on the frequency of the firing. The firing rate of a neuron with an activation function \(f\) represents the frequency of the spikes along the axon. We saw some examples of activations functions earlier.
The diagram below illustrates the parallels between the brain network-synaptic organization and an artificial synthetic neural network.
Before we look at some examples of deep learning algorithms applied to model observed natural phenomena, we will develop a couple of simple networks for computing fundamental Boolean operations.
The exclusive OR (XOR) operator works as a bivariate binary-outcome function, mapping pairs of false (0) and true (1) values to dichotomous false (0) and true (1) outcomes.
We can design a simple two-layer neural network that calculates XOR. The values within each neurons represent its explicit threshold, which can be normalized so that all neurons utilize the same threshold, typically \(1\)). The value labels associated with network connections (edges) represent the weights of the inputs. When the threshold is not reached, output is \(0\), and then the threshold is reached, the output is correspondingly \(1\).
Let’s work out manually the 4 possibilities:
| InputX | InputY | XOR Output(Z) |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
We can validate that this network indeed represents an XOR operator by plugging in all four possible input combinations and confirming the expected results at the end.
Another binary operator is NAND (negative AND, Sheffer stroke) that produces a false (0) output if and only if both of its operands are true (1), and generates true (1), otherwise. Below is the NAND input-output table.
| InputX | InputY | NAND Output(Z) |
|---|---|---|
| 0 | 0 | 1 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Similarly to the XOR operator, we can design a one-layer neural network that calculates NAND. The values within each neurons represent its explicit threshold, which can be normalized so that all neurons utilize the same threshold, typically \(1\)). The value labels associated with network connections (edges) represent the weights of the inputs. When the threshold is not reached, the output is trivial (\(0\)) and when the threshold is reached, the output is correspondingly \(1\). Here is a shorthand analytic expression for the NAND calculation:
\[NAND(X,Y) = 1.3 - (1\times X + 1\times Y).\] Check that \(NAND(X,Y)=0\) if and only if \(X=1\) and \(Y=1\), otherwise it equals \(1\).
Observe that stringing some of these primitive networks together, or/and increasing the number of hidden layers, allows us to model problems with exponentially increasing complexity. For instance, constructing a 4-input NAND function would simply require repeating several of our 2-input NAND operators. This will increase the space of possible outcomes from \(2^2\) to \(2^4\). Of course, introducing more depth in the hidden layers further expands the complexity of the problems that can be modeled using neural nets.
You can interactively manipulate the Google’s TensorFlow Deep Neural Network Webapp to gain additional intuition and experience with the various components of deep learning networks.
The ConvnetJS demo provide another hands-on example using 2D classification with 2-layer neural network.
KerasThere are many different neural-net and deep-learning frameworks. The table below summarizes some of the main deep learning R packages.
| Package | Description |
|---|---|
| nnet | Feed-forward neural networks using 1 hidden layer |
| neuralnet | Training backpropagation neural networks |
| tensorflow | Google TensorFlow used in TensorBoard (see SOCR UKBB Demo) |
| deepnet | Deep learning toolkit |
| darch | Deep Architectures based on Restricted Boltzmann Machines |
| rnn | Recurrent Neural Networks (RRNs) |
| rcppDL | Multi-layer machine learning methods including dA (Denoising Autoencoder), SdA (Stacked Denoising Autoencoder), RBM (Restricted Boltzmann machine), and DBN (Deep Belief Nets) |
| deepr | DL training, fine-tuning and predicting processes using darch and deepnet |
| MXNetR | Flexible and efficient ML/DL tools utilizing CPU/GPU computing |
| kerasR | RStudio’s keras DL implementation wrapping C++/Python executable libraries |
| Keras | Python based neural networks API, connecting Google TensorFlow, Microsoft Cognitive Toolkit (CNTK), and Theano |
Most DL/ML R packages provide interfaces (APIs) to libraries that are build using foundational languages like C/C++ and Java. Most of the Python libraries also act as APIs to lower-level executables compiles for specific platforms (Mac, Linux, PC).
The keras package uses the pipe operator (%>%) to join multiple functions or operators, which streamlines the readability of the script protocol.
The kerasR package contains analogous functions to the ones in keras and utilizes the \(\$\) operator to create models. There are parallels between the core Python methods and their keras counterparts: compile() and keras_compile(), fit() and keras_fit(), predict() and keras_predict().
Below we will demonstrate the utilization of the Keras package for deep neural network analytics. This will require installation of keras and TensorFlow via R devtools::install_github("rstudio/keras").
# devtools::install_github("rstudio/keras")
library("keras")
# install_keras()
#install.packages("tensorflow")
library(tensorflow)
#install_tensorflow()The Keras package includes built-in datasets with load() functions, e.g., mnist.load_data() and imdb.load_data().
mnist <- dataset_mnist()
imdb <- dataset_imdb()Instead of using the default data provided in the keras package, we will utilize one of the datasets on the DSPA Case-Studies Website, can also be loaded much like we have done before. Below, we download the Titanic Passenger Dataset.
library(reshape)
library(caret)
dat <- read.csv("https://umich.instructure.com/files/9372716/download?download_frd=1")
# Inspect for missing values (empty or NA):
dat.miss <- melt(apply(dat[, -2], 2, function(x) sum(is.na(x) | x=="")))
cbind(row.names(dat.miss)[dat.miss$value>0], dat.miss[dat.miss$value>0,])## [,1] [,2]
## [1,] "age" "263"
## [2,] "fare" "1"
## [3,] "cabin" "1014"
## [4,] "embarked" "2"
# We can exclude the "Cabin" feature which includes 80% missing values.
# Impute the few missing Embarked values using the most common value (S)
table(dat$embarked)##
## C Q S
## 2 270 123 914
dat$embarked[which(is.na(dat$embarked) | dat$embarked=="")] <- "S"
# Some "fare"" values may represent total cost of group purchases
# We can derive a new variable "price" representing fare per person
# Update missing fare value with 0
dat$fare[which(is.na(dat$fare))] <- 0
# calculate ticket Price (Fare per person)
ticket.count <- aggregate(dat$ticket, by=list(dat$ticket), function(x) sum( !is.na(x) ))
dat$price <- apply(dat, 1, function(x) as.numeric(x["fare"]) /
ticket.count[which(ticket.count[, 1] == x["ticket"]), 2])
# Impute missing prices (price=0) using the median price per passenger class
pclass.price<-aggregate(dat$price, by = list(dat$pclass), FUN = function(x) median(x, na.rm = T))
dat[which(dat$price==0), "price"] <-
apply(dat[which(dat$price==0), ] , 1, function(x)
pclass.price[pclass.price[, 1]==x["pclass"], 2])
# Define a new variable "ticketcount" coding the number of passengers sharing the same ticket number
dat$ticketcount <-
apply(dat, 1, function(x) ticket.count[which(ticket.count[, 1] ==
x["ticket"]), 2])
# Capture the passenger title
dat$title <-
regmatches(as.character(dat$name),
regexpr("\\,[A-z ]{1,20}\\.", as.character(dat$name)))
dat$title <-
unlist(lapply(dat$title,
FUN=function(x) substr(x, 3, nchar(x)-1)))
table(dat$title)##
## Capt Col Don Dona Dr
## 1 4 1 1 8
## Jonkheer Lady Major Master Miss
## 1 1 2 61 260
## Mlle Mme Mr Mrs Ms
## 2 1 757 197 2
## Rev Sir the Countess
## 8 1 1
# Bin the 17 alternative title groups into 4 common 4 titles (factors)
dat$title[which(dat$title %in% c("Mme", "Mlle"))] <- "Miss"
dat$title[which(dat$title %in%
c("Lady", "Ms", "the Countess", "Dona"))] <- "Mrs"
dat$title[which(dat$title=="Dr" & dat$sex=="female")] <- "Mrs"
dat$title[which(dat$title=="Dr" & dat$sex=="male")] <- "Mr"
dat$title[which(dat$title %in% c("Capt", "Col", "Don",
"Jonkheer", "Major", "Rev", "Sir"))] <- "Mr"
dat$title <- as.factor(dat$title)
table(dat$title)##
## Master Miss Mr Mrs
## 61 263 782 203
# Impute missing ages using median age for each title group
title.age <- aggregate(dat$age, by = list(dat$title),
FUN = function(x) median(x, na.rm = T))
dat[is.na(dat$age), "age"] <- apply(dat[is.na(dat$age), ] , 1,
function(x) title.age[title.age[, 1]==x["title"], 2])We can start by some simple EDA plots, reporting some numerical summaries, examining pairwise correlations, and showing the distributions of some features in this dataset.
library(ggplot2)
summary(dat)## pclass survived name
## Min. :1.000 Min. :0.000 Connolly, Miss. Kate : 2
## 1st Qu.:2.000 1st Qu.:0.000 Kelly, Mr. James : 2
## Median :3.000 Median :0.000 Abbing, Mr. Anthony : 1
## Mean :2.295 Mean :0.382 Abbott, Master. Eugene Joseph : 1
## 3rd Qu.:3.000 3rd Qu.:1.000 Abbott, Mr. Rossmore Edward : 1
## Max. :3.000 Max. :1.000 Abbott, Mrs. Stanton (Rosa Hunt): 1
## (Other) :1301
## sex age sibsp parch
## female:466 Min. : 0.1667 Min. :0.0000 Min. :0.000
## male :843 1st Qu.:22.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :30.0000 Median :0.0000 Median :0.000
## Mean :29.5689 Mean :0.4989 Mean :0.385
## 3rd Qu.:36.0000 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.0000 Max. :8.0000 Max. :9.000
##
## ticket fare cabin embarked
## CA. 2343: 11 Min. : 0.000 :1014 : 0
## 1601 : 8 1st Qu.: 7.896 C23 C25 C27 : 6 C:270
## CA 2144 : 8 Median : 14.454 B57 B59 B63 B66: 5 Q:123
## 3101295 : 7 Mean : 33.270 G6 : 5 S:916
## 347077 : 7 3rd Qu.: 31.275 B96 B98 : 4
## 347082 : 7 Max. :512.329 C22 C26 : 4
## (Other) :1261 (Other) : 271
## price ticketcount title
## Min. : 3.171 Min. : 1.000 Master: 61
## 1st Qu.: 7.729 1st Qu.: 1.000 Miss :263
## Median : 8.158 Median : 1.000 Mr :782
## Mean : 14.991 Mean : 2.102 Mrs :203
## 3rd Qu.: 15.050 3rd Qu.: 3.000
## Max. :128.082 Max. :11.000
##
cols <- c("red","green")[unclass(dat$survived)]
plot(dat$ticketcount, dat$fare, pch=21, cex=1.5,
bg=alpha(cols, 0.4),
xlab="Number of Tickets per Party", ylab="Passenger Fare",
main="Titanic Passenger Data (TicketCount vs. Fare) Color Coded by Survival")
legend("topright", inset=.02, title="Survival",
c("0","1"), fill=c("red", "green"), horiz=F, cex=0.8)ggplot(dat, aes(x=survived, y=fare, fill=sex)) +
#geom_dotplot(binaxis='y', stackdir='center',
# position=position_dodge(1)) +
#scale_fill_manual(values=c("#999999", "#E69F00")) +
geom_violin(trim=FALSE) +
theme(legend.position="top")library(GGally)
ggpairs(dat[ , c("pclass", "age", "sibsp", "parch",
"fare", "price", "ticketcount", "survived")],
aes(colour = as.factor(survived), alpha = 0.4))Before we go into modeling the data, we need to preprocess it, e.g., normalize the numerical values and split it into training and testing sets.
dat1 <- dat[ , c("pclass", "age", "sibsp", "parch", "fare",
"price", "ticketcount", "survived")]
dat1$pclass <- as.factor(dat1$pclass)
dat1$age <- as.numeric(dat1$age)
dat1$sibsp <- as.factor(dat1$sibsp)
dat1$parch <- as.factor(dat1$parch)
dat1$fare <- as.numeric(dat1$fare)
dat1$price <- as.numeric(dat1$price)
dat1$ticketcount <- as.numeric(dat1$ticketcount)
dat1$survived <- as.factor(dat1$survived)
# Set the `dimnames` to `NULL`
# dimnames(dat1) <- NULL
dim(dat1)## [1] 1309 8
Use keras::normalize() to normalize the numerical data.
# devtools::install_github("rstudio/keras")
# First install Anaconda/Python: https://www.anaconda.com/download/#windows
# install_keras()
# reticulate::py_config()
# library("keras")
# use_python("C:/Users/Dinov/AppData/Local/Programs/Python/Python37/python.exe")
# install_keras()
# install_keras(method = "conda")
library("keras")
# Normalize the data
summary(dat1[ , c(2,5,6,7)])## age fare price ticketcount
## Min. : 0.1667 Min. : 0.000 Min. : 3.171 Min. : 1.000
## 1st Qu.:22.0000 1st Qu.: 7.896 1st Qu.: 7.729 1st Qu.: 1.000
## Median :30.0000 Median : 14.454 Median : 8.158 Median : 1.000
## Mean :29.5689 Mean : 33.270 Mean : 14.991 Mean : 2.102
## 3rd Qu.:36.0000 3rd Qu.: 31.275 3rd Qu.: 15.050 3rd Qu.: 3.000
## Max. :80.0000 Max. :512.329 Max. :128.082 Max. :11.000
dat2 <- dat1[ , c(2,5,6,7)]
dat2 <- as.matrix(dat2)
dimnames(dat2) <- NULL
# May be best to avoid normalizing the ordinal variable "ticketcount"
dat2.norm <- keras::normalize(dat2)
# report the summary`
summary(dat2.norm)## V1 V2 V3 V4
## Min. :0.005962 Min. :0.0000 Min. :0.0797 Min. :0.007529
## 1st Qu.:0.529205 1st Qu.:0.2891 1st Qu.:0.2265 1st Qu.:0.026743
## Median :0.823905 Median :0.4298 Median :0.2845 Median :0.034269
## Mean :0.700518 Mean :0.5105 Mean :0.2961 Mean :0.049940
## 3rd Qu.:0.913000 3rd Qu.:0.7543 3rd Qu.:0.3640 3rd Qu.:0.050859
## Max. :0.991762 Max. :0.9884 Max. :0.7033 Max. :0.246830
colnames(dat2.norm) <- c("age", "fare", "price", "ticketcount")Next, we’ll partition the raw data into training (80%) and testing (20%) sets that will be utilized to build the forecasting model (to predict Titanic passenger survival) and assess the model performance, respectively.
train_set_ind <- sample(nrow(dat2.norm), floor(nrow(dat2.norm)*0.8)) # 80:20 plot training:testing
train_dat2.X <- dat2.norm[train_set_ind, ]
train_dat2.Y <- dat1[train_set_ind , 8] # Outcome "survived" column:8
test_dat2.X <- dat2.norm[-train_set_ind, ]
test_dat2.Y <- dat1[-train_set_ind , 8] # Outcome "survived" column:8
# double check the size/dimensions of the training and testing data (predictors and responses)
dim(train_dat2.X); length(train_dat2.Y); dim(test_dat2.X); length(test_dat2.Y)## [1] 1047 4
## [1] 1047
## [1] 262 4
## [1] 262
For multi-class classification problems via NN modeling, the keras::to_categorical() function allows us to transform the outcome attribute from a vector of class labels to a matrix of Boolean features, one for each class label. In this case, we have a bivariate (binary classification), passenger survival indicator.
Keras modeling starts with first initializing a sequential model using the keras::keras_model_sequential() function.
We will try to predict the passenger survival using a fully-connected multi-layer perceptron NN. We will need to choose an activation function, e.g., relu, sigmoid. A rectifier activation function (relu) may be used in a hidden layer and a softmax activation function may be used in the final output layer so that the outputs represent (posterior) probabilities between 0 and 1, corresponding to the odds of survival. In the first layer, we can specify 8 hidden nodes (units), an input_shape of 4, to reflect the 4 features in the training data age, fare, price, ticketcount, and the output layer with 2 output values, one for each of the survival categories. We can also inspect the structure of the NN model using:
summary(): print a summary representation of your model,get_config(): return a list that contains the configuration of the model,get_layer(): return the layer configuration,$layers: NN model attribute retrieves a flattened list of the model’s layers,$inputs: NN model attribute listing the input tensors,$outputs: NN model attribute retrieve the output tensors.model.1 <- keras_model_sequential()
# Add layers to the model
model.1 %>%
layer_dense(units = 8, activation = 'relu', input_shape = c(4)) %>%
layer_dense(units = 2, activation = 'softmax')
# NN model summary
summary(model.1)## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense_1 (Dense) (None, 8) 40
## ___________________________________________________________________________
## dense_2 (Dense) (None, 2) 18
## ===========================================================================
## Total params: 58
## Trainable params: 58
## Non-trainable params: 0
## ___________________________________________________________________________
# Report model configuration
get_config(model.1)## {'name': 'sequential_1', 'layers': [{'class_name': 'Dense', 'config': {'name': 'dense_1', 'trainable': True, 'batch_input_shape': (None, 4), 'dtype': 'float32', 'units': 8, 'activation': 'relu', 'use_bias': True, 'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'scale': 1.0, 'mode': 'fan_avg', 'distribution': 'uniform', 'seed': None}}, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'kernel_regularizer': None, 'bias_regularizer': None, 'activity_regularizer': None, 'kernel_constraint': None, 'bias_constraint': None}}, {'class_name': 'Dense', 'config': {'name': 'dense_2', 'trainable': True, 'units': 2, 'activation': 'softmax', 'use_bias': True, 'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'scale': 1.0, 'mode': 'fan_avg', 'distribution': 'uniform', 'seed': None}}, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'kernel_regularizer': None, 'bias_regularizer': None, 'activity_regularizer': None, 'kernel_constraint': None, 'bias_constraint': None}}]}
# report layer configuration
get_layer(model.1, index = 1)## <keras.layers.core.Dense>
# Report model layers
model.1$layers## [[1]]
## <keras.layers.core.Dense>
##
## [[2]]
## <keras.layers.core.Dense>
# List the input tensors
model.1$inputs## [[1]]
## Tensor("dense_1_input:0", shape=(?, 4), dtype=float32)
# List the output tensors
model.1$outputs## [[1]]
## Tensor("dense_2/Softmax:0", shape=(?, 2), dtype=float32)
Once the model architecture is specified, we need to estimate (fit) the NN model using the training data. The adaptive momentum (ADAM) optimizer along with categorical_crossentropy objective function may be used to compile the NN model. Specifying accuracy as a metrics argument allows us to inspect the quality of the NN model fit during the training phase (training data validation). The optimizer and the objective (loss) function are the pair of require arguments for model compilation.
In addition to ADAM, alternative optimization algorithms include Stochastic Gradient Descent (SGD) and Root Mean Square proportion (RMSprop). ADAM is essentially RMSprop with momentum whereas NADAM is ADAM RMSprop with Nesterov momentum. Following the selection of the optimization algorithm, we need to tune the model parameters, e.g., learning rate or momentum. Choosing an appropriate objective function function depends on the classification or regression forecasting task, e.g., regression prediction (continuous outcomes) usually utilize Mean Squared Error (MSE), whereas multi-class classification problems use categorical_crossentropy loss function and binary classification problems commonly use binary_crossentropy loss function.
# "Compile"" the model
model.1 %>% compile(
loss = 'binary_crossentropy',
optimizer = 'adam',
metrics = 'accuracy'
)The next step fits the NN model (model.1) to the training data using 200 epochs, or iterations over all the samples in train_dat2.X (predictors) and train_dat2.Y (outcomes), in batches of 10 samples. This process trains the model a specified number of epochs (iterations or exposures) on the training data. One epoch is a single pass through the whole training set followed by comparing the model prediction results against the verification labels. The batch size defines the number of samples being propagated through the network at ones (as a batch).
# convert the labels to categorical values
train_dat2.Y <- to_categorical(train_dat2.Y)
test_dat2.Y <- to_categorical(test_dat2.Y)
# Fit the model & Store the fitting history
track.model.1 <- model.1 %>% fit(
train_dat2.X,
train_dat2.Y,
epochs = 200,
batch_size = 10,
validation_split = 0.2
)We can visualize the model fitting process using keras::plot() jointly depicting the loss of the objective function and the accuracy of the model, across epochs. Alternatively, we can split the pair of plots - one for the model loss and the other one for the model accuracy. The \(\$\) operator is used to access the tensor data and plot it step-by-step. A sign of overfitting may be an accuracy (on training data) that keeps improving while the accuracy (on the validation data) worsens. This may be in indication that the NN model starts to learn noise in the data instead of learning real patterns or affinities in the data. While the accuracy trends of both datasets are rising towards the final epochs, this may indicate that the model is still in the process of learning on the training dataset (and we can increase the number of epochs).
# Plot the history
plot(track.model.1)# NN model loss on the training data
plot(track.model.1$metrics$loss, main="Model 1 Loss",
xlab = "Epoch", ylab="Loss", col="green", type="l", ylim=c(0.54, 0.6))
# NN model loss of the 20% validation data
lines(track.model.1$metrics$val_loss, col="blue", type="l")
# Add legend
legend("right", c("train", "test"), col=c("green", "blue"), lty=c(1,1))# Plot the accuracy of the training data
plot(track.model.1$metrics$acc, main="Model 1 Accuracy",
xlab = "Epoch", ylab="Accuracy", col="blue", type="l", ylim=c(0.65, 0.75))
# Plot the accuracy of the validation data
lines(track.model.1$metrics$val_acc, col="green")
# Add Legend
legend("bottom", c("Training", "Testing"), col=c("blue", "green"), lty=c(1,1))Once the model is fit, we can use it to predict the survival of passengers using the testing data, test_dat2.X. As we have seen before predict() provides this functionality. Finally, we can evaluate the performance of the NN model by comparing the predicted class labels and test_dat2.Y using table() or confusionMatrix().
# Predict the classes for the test data
predict.survival <- model.1 %>% predict_classes(test_dat2.X, batch_size = 30)
# Confusion matrix
test_dat2.Y <- dat1[-train_set_ind , 8]
table(test_dat2.Y, predict.survival)## predict.survival
## test_dat2.Y 0 1
## 0 129 32
## 1 58 43
confusionMatrix(test_dat2.Y, as.factor(predict.survival))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 129 32
## 1 58 43
##
## Accuracy : 0.6565
## 95% CI : (0.5956, 0.7138)
## No Information Rate : 0.7137
## P-Value [Acc > NIR] : 0.981625
##
## Kappa : 0.2384
## Mcnemar's Test P-Value : 0.008408
##
## Sensitivity : 0.6898
## Specificity : 0.5733
## Pos Pred Value : 0.8012
## Neg Pred Value : 0.4257
## Prevalence : 0.7137
## Detection Rate : 0.4924
## Detection Prevalence : 0.6145
## Balanced Accuracy : 0.6316
##
## 'Positive' Class : 0
##
We can also utilize the evaluate() function to assess the model quality using testing data.
# Evaluate on test data and labels
test_dat2.Y <- to_categorical(test_dat2.Y)
model1.qual <- model.1 %>% evaluate(test_dat2.X, test_dat2.Y, batch_size = 30)
print(model1.qual)## $loss
## [1] 0.6092251
##
## $acc
## [1] 0.6564886
The main NN model parameters we can adjust to improve the model quality include:
Models can be improved by adding additional layers, increasing the number of hidden units, and by tuning the optimization parameters in compile(). Let’s first try to add another layer to the N model.
# Initialize the sequential model
model.2 <- keras_model_sequential()
# Add layers to model
model.2 %>%
layer_dense(units = 8, activation = 'relu', input_shape = c(4)) %>%
layer_dense(units = 6, activation = 'relu') %>%
layer_dense(units = 2, activation = 'softmax')
# Compile the model
model.2 %>% compile(
loss = 'binary_crossentropy',
optimizer = 'adam',
metrics = 'accuracy'
)
# Fit NN model to training data & Save the training history
track.model.2 <- model.2 %>% fit(
train_dat2.X,
train_dat2.Y,
epochs = 200,
batch_size = 10,
validation_split = 0.2
)
# Evaluate the model
model2.qual <- model.2 %>% evaluate(test_dat2.X, test_dat2.Y, batch_size = 30)
print(model2.qual)## $loss
## [1] 0.6136642
##
## $acc
## [1] 0.6870229
# EDA on the loss and accuracy metrics of this model.2
# Plot the history
plot(track.model.2)# NN model loss on the training data
plot(track.model.2$metrics$loss, main="Model Loss",
xlab = "Epoch", ylab="Loss", col="green", type="l", ylim=c(0.54, 0.6))
# NN model loss of the 20% validation data
lines(track.model.2$metrics$val_loss, col="blue", type="l")
# Add legend
legend("right", c("Training", "Testing"), col=c("green", "blue"), lty=c(1,1))# Plot the accuracy of the training data
plot(track.model.2$metrics$acc, main="Model 2 (Extra Layer) Accuracy",
xlab = "Epoch", ylab="Accuracy", col="blue", type="l", ylim=c(0.65, 0.76))
# Plot the accuracy of the validation data
lines(track.model.2$metrics$val_acc, col="green")
# Add Legend
legend("top", c("Training", "Testing"), col=c("blue", "green"), lty=c(1,1))Next we can examine the effects of adding more hidden units to the NN model.
# Initialize a sequential model
model.3 <- keras_model_sequential()
# Add layers and Node-Units to model
model.3 %>%
layer_dense(units = 30, activation = 'relu', input_shape = c(4)) %>%
layer_dense(units = 15, activation = 'relu') %>%
layer_dense(units = 2, activation = 'softmax')
# Compile the model
model.3 %>% compile(
loss = 'binary_crossentropy',
optimizer = 'adam',
metrics = 'accuracy'
)
# Fit NN model to training data & Save the training history
track.model.3 <- model.3 %>% fit(
train_dat2.X,
train_dat2.Y,
epochs = 200,
batch_size = 10,
validation_split = 0.2
)
# Evaluate the model
model3.qual <- model.3 %>% evaluate(test_dat2.X, test_dat2.Y, batch_size = 30)
print(model3.qual)## $loss
## [1] 0.6251848
##
## $acc
## [1] 0.6679389
# EDA on the loss and accuracy metrics of this model.2
# Plot the history
plot(track.model.3)# NN model loss on the training data
plot(track.model.3$metrics$loss, main="Model Loss",
xlab = "Epoch", ylab="Loss", col="green", type="l", ylim=c(0.54, 0.7))
# NN model loss of the 20% validation data
lines(track.model.3$metrics$val_loss, col="blue", type="l")
# Add legend
legend("top", c("Training", "testing"), col=c("green", "blue"), lty=c(1,1))# Plot the accuracy of the training data
plot(track.model.3$metrics$acc, main="Model 3 (Extra Layer/More Hidden Units)",
xlab = "Epoch", ylab="Accuracy", col="blue", type="l", ylim=c(0.65, 0.76))
# Plot the accuracy of the validation data
lines(track.model.3$metrics$val_acc, col="green")
# Add Legend
legend("top", c("Training", "Testing"), col=c("blue", "green"), lty=c(1,1))Finally, we can attempt to fine-tune the optimization parameters provided to the compile() function. For instance, we can experiment with alternative optimization algorithms, like the Stochastic Gradient Descent (SGD), optimizer_sgd(), and adjust the learning rate, lr. Specifying alternative learning rate to train the NN, typically 10-fold increase or decrease, to trade accuracy, speed of convergence, and avoiding local minima.
model.4 <- keras_model_sequential()
# Add layers and Node-Units to model
model.4 %>%
layer_dense(units = 30, activation = 'relu', input_shape = c(4)) %>%
layer_dense(units = 15, activation = 'relu') %>%
layer_dense(units = 2, activation = 'softmax')
# Define an optimizer
SGD <- optimizer_sgd(lr = 0.001)
# Compile the model
model.4 %>% compile(
optimizer=SGD,
loss = 'binary_crossentropy',
metrics = 'accuracy'
)
# Fit NN model to training data & Save the training history
set.seed(1234)
track.model.4 <- model.4 %>% fit(
train_dat2.X,
train_dat2.Y,
epochs = 200,
batch_size = 10,
validation_split = 0.1
)
# Evaluate the model
model4.qual <- model.4 %>% evaluate(test_dat2.X, test_dat2.Y, batch_size = 30)
print(model4.qual)## $loss
## [1] 0.6246598
##
## $acc
## [1] 0.6603054
# EDA on the loss and accuracy metrics of this model.2
# Plot the history
plot(track.model.4)# NN model loss on the training data
plot(track.model.4$metrics$loss, main="Model 4 Loss",
xlab = "Epoch", ylab="Loss", col="green", type="l", ylim=c(0.54, 0.7))
# NN model loss of the 20% validation data
lines(track.model.4$metrics$val_loss, col="blue", type="l")
# Add legend
legend("top", c("Training", "Testing"), col=c("green", "blue"), lty=c(1,1))# Plot the accuracy of the training data
plot(track.model.4$metrics$acc, main="Model 4 (Extra Layer/More Hidden Units/SGD)",
xlab = "Epoch", ylab="Accuracy", col="blue", type="l", ylim=c(0.65, 0.76))
# Plot the accuracy of the validation data
lines(track.model.4$metrics$val_acc, col="green")
# Add Legend
legend("top", c("Training", "Testing"), col=c("blue", "green"), lty=c(1,1))Intermediate and final NN models may be saved, (re)loaded, and exported using of save_model_hdf5() and load_model_hdf5() based on the HDF5 file format (h5). We can operate on complete models on just on the model weights. the models can also be exported in JSON or YAML formats using model_to_json() and model_to_yaml(), and their load counterparts model_from_json() and model_from yaml().
save_model_hdf5(model.4, "model.4.h5")
model.new <- load_model_hdf5("model.4.h5")
save_model_weights_hdf5("model_weights.h5")
model.old %>% load_model_weights_hdf5("model_weights.h5")
json_string <- model_to_json(model.old)
model.new <- model_from_json(json_string)
yaml_string <- model_to_yaml(model.old)
model.new <- model_from_yaml(yaml_string)MXNetIn MXNet, a Multilayer perceptron (MLP) may be defined by:
data = mx.sym.Variable('data')fc1 = mx.sym.FullyConnected(data=data, name='fc1', num_hidden=128)relu function to the output of the first fully-connnected layer, act1 = mx.sym.Activation(data=fc1, name='relu1', act_type="relu")fc2 = mx.sym.FullyConnected(data=act1, name='fc2', num_hidden = 64); act2 = mx.sym.Activation(data=fc2, name='relu2', act_type="relu")fc3 = mx.sym.FullyConnected(data=act2, name='fc3', num_hidden=10)mlp = mx.sym.SoftmaxOutput(data=fc3, name='softmax'). See the mlp R source code here.Let’s load the mlbench and mlbench packages and demonstrate the basic invocation of mxnet. The Sonar data mlbench::Sonar includes sonar signals bouncing off a metal cylinder or a roughly cylindrical rock. Each of 208 patterns includes a set of 60 numbers (features) in the range 0.0 to 1.0, and a label M (metal) or R (rock). Each feature represents the energy within a particular frequency band, integrated over a certain period of time. The M and R labels associated with each observation classify the record as rock or mine (metal) cylinder. The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly.
# Load the required packages: mlbench and mxnet
# install.packages("mlbench"); install.packages("mxnet")
# Note mxnet requires "visNetwork"
# If it doesn't work, you may need the following lines:
# install.packages("drat", repos="https://cran.rstudio.com")
# drat:::addRepo("dmlc")
# install.packages("mxnet")
require(mlbench)
require(mxnet)
data(Sonar, package="mlbench")
table(Sonar[,61])##
## M R
## 111 97
Sonar[,61] = as.numeric(Sonar[,61])-1 # R = "1", "M" = "0"
set.seed(123)
train.ind = sample(1:nrow(Sonar),0.7*nrow(Sonar))
train.x = data.matrix(Sonar[train.ind, 1:60])
train.y = Sonar[train.ind, 61]
test.x = data.matrix(Sonar[-train.ind, 1:60])
test.y = Sonar[-train.ind, 61]Let’s start by using a multi-layer perceptron as a classifier. The mxnet function mx.mlp builds a general multi-layer neural network that can be utilized to do classification or regression graph modeling. It relies on the following parameters:
mx.model.FeedForward.createHere is one example using the training and testing data we defined above:
mx.set.seed(1234)
model.mx <- mx.mlp(train.x, train.y, hidden_node=8, out_node=2, out_activation="softmax",
num.round=200, array.batch.size=15, learning.rate=0.1, momentum=0.9,
eval.metric=mx.metric.accuracy,verbose=F)
#calculate Prediction sensitivity & specificity
preds = predict(model.mx, test.x) # these are probabilities
# You can inspect the test labels vs. assigned probabilities by
# View(data.frame(test.y, preds[2,]))
preds1 <- ifelse(preds[2,] <= 0.5, 0, 1) # dichotomize to labels
table(preds1)## preds1
## 0 1
## 36 27
pred.label = t(preds1)
table(pred.label, test.y)## test.y
## pred.label 0 1
## 0 30 6
## 1 4 23
library("caret")
sensitivity(factor(preds1), factor(as.numeric(test.y)),positive = 1)## [1] 0.7931034
specificity(factor(preds1), factor(as.numeric(test.y)),negative = 0)## [1] 0.8823529
We can also use crossval::diagnosticErrors() and crossval::confusionMatrix() to get more detailed evaluations. Similar to sensitivity() and specificity(), we should specify the negative and positive labels.
Note that you have to specify crossval::confusionMatrix() if you also have the caret package loaded, as caret also has a function called confusionMatrix().
library("crossval")##
## Attaching package: 'crossval'
## The following object is masked from 'package:caret':
##
## confusionMatrix
diagnosticErrors(crossval::confusionMatrix(preds1,test.y, negative = 0))## acc sens spec ppv npv lor
## 0.8412698 0.8518519 0.8333333 0.7931034 0.8823529 3.3586378
## attr(,"negative")
## [1] 0
Now, we compare the results of different number of rounds, or epochs, representing the number of full (training-phase) passes through the data (cf. num.round=n).
mx.set.seed(1234)
get_pred = function(n){
model.mx <- mx.mlp(train.x, train.y, hidden_node=8, out_node=2, out_activation="softmax",
num.round=n, array.batch.size=15, learning.rate=0.1, momentum=0.9,
eval.metric=mx.metric.accuracy,verbose=F)
preds = predict(model.mx, test.x)
}
preds100 = get_pred(100)
preds50 = get_pred(50)
preds10 = get_pred(10)We can plot the ROC curve and calculate the AUC (Area under the curve):
# install.packages("pROC"); install.packages("plotROC"); install.packages("reshape2")
library(pROC); library(plotROC); library(reshape2); ## Warning: package 'pROC' was built under R version 3.5.1
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Warning: package 'plotROC' was built under R version 3.5.1
##
## Attaching package: 'plotROC'
## The following object is masked from 'package:pROC':
##
## ggroc
## Warning: package 'reshape2' was built under R version 3.5.1
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:reshape':
##
## colsplit, melt, recast
# compute AUC
get_roc = function(preds){
roc_obj <- roc(test.y, preds[2,])
auc(roc_obj)
}
get_roc(preds)## Area under the curve: 0.9087
get_roc(preds100)## Area under the curve: 0.9026
get_roc(preds50)## Area under the curve: 0.9087
get_roc(preds10)## Area under the curve: 0.7576
#plot roc
dt <- data.frame(test.y, preds[2,], preds100[2,], preds50[2,], preds10[2,])
colnames(dt) <- c("D","rounds=200","rounds=100","rounds=50","rounds=10")
dt <- melt(dt,id.vars = "D")
basicplot <- ggplot(dt, aes(d = D, m = value, colour=variable)) + geom_roc(labels = FALSE, size = 0.5, alpha.line = 0.6, linejoin = "mitre") +
theme_bw() + coord_fixed(ratio = 1) + style_roc() + ggtitle("ROC CURVE")+
ggplot2::annotate("rect", xmin = 0.4, xmax = 0.9, ymin = 0.1, ymax = 0.5,
alpha = 0.2)+
ggplot2::annotate("text", x = 0.65, y = 0.32, size = 3,
label = "AUC: \n rounds=200: 0.9209\n rounds=100: 0.9128\n rounds=50: 0.8824\n rounds=10: 0.8022\n " )
basicplotThe plot suggests that the results stabilize after 100 training (epoch) iterations.
Let’s look at some visualizations of the real labels of the test data (test.y) and their corresponding ML-derived classification labels (preds[2,]) using 200 iterations.
graph.viz(model.mx$symbol)hist(preds10[2,],main = "rounds=10")hist(preds50[2,],main = "rounds=50")hist(preds100[2,],main = "rounds=100")hist(preds[2,],main = "rounds=200")We see a significant bimodal trend when the number of rounds increases. Another plot may be generated by:
library(ggplot2)
get_gghist = function(preds){
ggplot(data.frame(test.y, preds),
aes(x=preds, group=test.y, fill=as.factor(test.y)))+
geom_histogram(position="dodge",binwidth=0.25)+theme_bw()
}
df = data.frame(preds[2,],preds100[2,],preds50[2,],preds10[2,])
p <- lapply(df,get_gghist)
require(gridExtra) # used for arrange ggplots## Loading required package: gridExtra
grid.arrange(p$preds10.2...,p$preds50.2...,p$preds100.2...,p$preds.2...)The mx.mlp() function is a proxy to the more complex and laborious process of defining a neural network by using MXNet’s Symbol. For instance, this call model.mx <- mx.mlp(train.x, train.y, hidden_node=8, out_node=2, out_activation="softmax", num.round=20, array.batch.size=15, learning.rate=0.1, momentum=0.9, eval.metric=mx.metric.accuracy) would be equivalent to a symbolic network definition like: data <- mx.symbol.Variable("data"); fc1 <- mx.symbol.FullyConnected(data, num_hidden=128) act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu"); fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64); act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu"); fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=2); lro <- mx.symbol.SoftmaxOutput(fc3, name="sm"); model2 <- mx.model.FeedForward.create(lro, X=train.x, y=train.y, ctx=mx.cpu(), num.round=100, array.batch.size=15, learning.rate=0.07, momentum=0.9) (see example with linear regression below).
mx.symbol:mx.symbol.Convolution applies convolution to the input and then adds a bias. It can create convolutional neural networks.mx.symbol.Deconvolution does the opposite and can be used in segmentation networks along with mx.symbol.UpSampling, e.g., to reconstruct the pixel-wise classification of an image.mx.symbol.Pooling reduces the data by selecting signals with the highest response.mx.symbol.Flatten links convolutional and pooling layers to form a fully connected network.mx.symbol.Dropout attempts to cope with the overfitting problem.
The function mx.mlp() is a wrapper for quick design of standard multi-layer perceptrons. For more extensive experiments, customized symbolic representation can be explicitly specified using combinations of the above methods.
# Example of **LeNet** network for recognizing handwritten digits:
data <- mx.symbol.Variable('data')
conv1 <- mx.symbol.Convolution(data=data, kernel=c(5,5), num_filter=20)
tanh1 <- mx.symbol.Activation(data=conv1, act_type="tanh")
pool1 <- mx.symbol.Pooling(data=tanh1, pool_type="max", kernel=c(2,2), stride=c(2,2))
conv2 <- mx.symbol.Convolution(data=pool1, kernel=c(5,5), num_filter=50)
tanh2 <- mx.symbol.Activation(data=conv2, act_type="tanh")
pool2 <- mx.symbol.Pooling(data=tanh2, pool_type="max", kernel=c(2,2), stride=c(2,2))
flatten <- mx.symbol.Flatten(data=pool2)
fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=500)
tanh3 <- mx.symbol.Activation(data=fc1, act_type="tanh")
fc2 <- mx.symbol.FullyConnected(data=tanh3, num_hidden=10)
lenet <- mx.symbol.SoftmaxOutput(data=fc2)
model <- mx.model.FeedForward.create(lenet, X=train.x, y=train.y, ctx=device.cpu, num.round=5, array.batch.size=100, learning.rate=0.05, momentum=0.9)To allow smooth, fast, and consistent operation on CPU and GPU, in in mxnet, the generic R function controlling the reproducibility of stochastic results is overwritten by mx.set.seed. So can use mx.set.seed() to control random numbers in MXNet.
To examine the accuracy of the model.mx learner (trained on the training data), we can make prediction (on testing data) and evaluate the results using the provided testing labels (report the confusion matrix).
preds = predict(model.mx, test.x)
pred.label = max.col(t(preds))-1
table(pred.label, test.y)## test.y
## pred.label 0 1
## 0 30 6
## 1 4 23
For multi-class predictions, mxnet outputs \(n\ (class) \times m\ (examples)\) confusion matrices where each row corresponds to probability of the corresponding (column-defined) class.
Let’s demonstrate regression and prediction deep learning examples using several complementary datasets.
Let’s first demonstrate a deep learning regression using the ALS data to predict ALSFRS_slope.
als <- read.csv("https://umich.instructure.com/files/1789624/download?download_frd=1")
als <- scale(als[,-c(1,7)])
train.ind = sample(1:nrow(als),0.7*nrow(als))
train.x = data.matrix(als[train.ind,-c(1,7)])
train.y = als[train.ind,7]
test.x = data.matrix(scale(als[-train.ind,-c(1,7)]))
test.y = als[-train.ind,7]
# Define the input data
data <- mx.symbol.Variable("data")
# A fully connected hidden layer
# data: input source
# num_hidden: number of neurons in this hidden layer
fc1 <- mx.symbol.FullyConnected(data, num_hidden=1)
# Use linear regression for the output layer
lro <- mx.symbol.LinearRegressionOutput(fc1)
mx.set.seed(1234)
# Create a MXNet Feedforward neural net model with the specified training.
model <- mx.model.FeedForward.create(lro, X=train.x, y=train.y,
ctx=mx.cpu(), num.round=1000, array.batch.size=20,
learning.rate=2e-6, momentum=0.9, eval.metric=mx.metric.rmse,verbose = F)The option verbose = F can suppress messages including training accuracy report in each iteration. You may rerun the code with verbose = T to examine the rate of decrease of train error against the number of iterations.
You must scale data before input it into MXnet, which expects that the training and testing sets are normalized to the same scale. There are two strategies to scale the data.
Have a look at the Google TensorFlow API. It shows the importance of learning rate and the number of rounds. You should test different sets of parameters.
preds = predict(model, test.x)## Warning in mx.model.select.layout.predict(X, model): Auto detect layout of input matrix, use rowmajor..
sqrt(mean((preds-test.y)^2))## [1] 0.2201744
range(test.y)## [1] -2.880005 1.943890
# plot the correlation between testdata.y and testdata.predicted.y
plot(preds, test.y)We can see that the RMSE on the test set is pretty small. To get a visual representation of the deep learning network we can also display this relatively simple computation graph:
graph.viz(model$symbol)We can again use the mx.mlp wrapper to construct the learning network, but we can also use a more flexible way to construct and configure the multi-layer network in mxnet. This configuration is done by using the Symbol call, which specifies the links among network nodes, the activation function, dropout ratio, and so on:
Below we show configuration of perceptron with one hidden layer.
########### Network configuration
# variables
act <- mx.symbol.Variable("data")
# affine transformation
fc <- mx.symbol.FullyConnected(act, num.hidden = 10)
# non-linear activation
act <- mx.symbol.Activation(data = fc, act_type = "relu")
# affine transformation
fc <- mx.symbol.FullyConnected(act, num.hidden = 2)
# softmax output and cross-
mlp <- mx.symbol.SoftmaxOutput(fc)
####Preparing data
set.seed(2235)
############ spirals dataset
s <- sample(x = c("train", "test"), size = 1000, prob = c(.8,.2), replace = TRUE)
dta <- mlbench.spirals(n = 1000, cycles = 1.2, sd = .03)
dta <- cbind(dta[["x"]], as.integer(dta[["classes"]]) - 1)
colnames(dta) <- c("x","y","label")
######### train, validate, test
dta.train <- dta[s == "train",]
dta.test <- dta[s == "test",]Let’s display the data and examine its structure.
dt <- as.data.frame(dta);dt[,3] <- as.factor(dt[,3])
dt.train <- dt[s == "train",]
dt.test <- dt[s == "test",]
p1 <- ggplot(dt,aes(x = x,y = y,color=label))+geom_point()+ggtitle("Whole data structure")
p2 <- ggplot(dt.train,aes(x = x,y = y,color=label))+geom_point()+ggtitle("Train data structure")
p3 <- ggplot(dt.test,aes(x = x,y = y,color=label))+geom_point()+ggtitle("Test data structure")
grid.arrange(p1,p2,p3,nrow=3)# Network training
# Feed-forward networks may be trained using iterative gradient descent algorithms. A **batch** is a subset of data that is used during single forward pass of the algorithm. An **epoch** represents one step of the iterative process that is repeated until all training examples are used.
############ basic spiral-data training
mx.set.seed(2235)
model <- mx.model.FeedForward.create(
symbol = mlp,
X = dta.train[, c("x", "y")],
y = dta.train[, c("label")],
num.round = 500,
array.layout = "rowmajor",
learning.rate = 1,
eval.metric = mx.metric.accuracy,verbose = F)
preds = predict(model, dta.test[,c(1:2)])## Warning in mx.model.select.layout.predict(X, model): Auto detect layout of input matrix, use rowmajor..
pred.label = max.col(t(preds))-1; table(pred.label, dta.test[,3])##
## pred.label 0 1
## 0 110 24
## 1 2 79
The prediction result is close to perfect and we can inspect deeper the results using crossval::confusionMatrix.
library("crossval")
diagnosticErrors(crossval::confusionMatrix(pred.label,dta.test[,3],negative = 0))## acc sens spec ppv npv lor
## 0.8790698 0.9753086 0.8208955 0.7669903 0.9821429 5.1987272
## attr(,"negative")
## [1] 0
ggplot(data.frame(dta.test[,3], preds[2,]),
aes(x=preds[2,], group=dta.test[,3], fill=as.factor(dta.test[,3])))+
geom_histogram(position="dodge",binwidth=0.25)+theme_bw()Once we fit a model (like the binary label classification below), we can:
# define a custom call-back, which stops the process of training when the progress in accuracy is below certain level of tolerance. It call is made after every epoch to check the status of convergence of the algorithm.
mx.callback.train.stop <- function(tol = 1e-3, mean.n = 1e2, period = 100, min.iter = 100) {
function(iteration, nbatch, env, verbose = TRUE) {
if (nbatch == 0 & !is.null(env$metric)) {
continue <- TRUE
acc.train <- env$metric$get(env$train.metric)$value
if (is.null(env$acc.log)) {
env$acc.log <- acc.train
} else {
if ((abs(acc.train - mean(tail(env$acc.log, mean.n))) < tol &
abs(acc.train - max(env$acc.log)) < tol &
iteration > min.iter) |
acc.train == 1) {
cat("Training finished with final accuracy: ",
round(acc.train * 100, 2), " %\n", sep = "")
continue <- FALSE
}
env$acc.log <- c(env$acc.log, acc.train)
}
}
if (iteration %% period == 0) {
cat("[", iteration,"]"," training accuracy: ",
round(acc.train * 100, 2), " %\n", sep = "")
}
return(continue)
}
}
###### training with custom stopping
mx.set.seed(2235)
model <- mx.model.FeedForward.create(
symbol = mlp,
X = dta.train[, c("x", "y")],
y = dta.train[, c("label")],
num.round = 2000,
array.layout = "rowmajor",
learning.rate = 1,
epoch.end.callback = mx.callback.train.stop(),
eval.metric = mx.metric.accuracy,
verbose = FALSE
) ## [100] training accuracy: 84.04 %
## [200] training accuracy: 90.62 %
## [300] training accuracy: 92.75 %
## [400] training accuracy: 94.31 %
## [500] training accuracy: 89.29 %
## [600] training accuracy: 94.2 %
## [700] training accuracy: 94.87 %
## [800] training accuracy: 94.87 %
## Training finished with final accuracy: 94.98 %
labeled_spiral_data <- as.data.frame(cbind(dta.test[, c("x", "y")], as.factor(pred.label)))
colnames(labeled_spiral_data) <- c("x", "y", "label")
labeled_spiral_data$label <- as.factor(labeled_spiral_data$label)
p4 <- ggplot(labeled_spiral_data, aes(x = x,y = y,color=label))+geom_point()+ggtitle("Structure of Predicted-Labels on Test Data")
p4Let’s try another example using the IBS NI Study data.
# IBS NI Data
library(xml2)
library(rvest)
# UCLA Data
wiki_url <- read_html("http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_April2011_NI_IBS_Pain")
IBSData<- html_table(html_nodes(wiki_url, "table")[[2]]) # table 2
set.seed(1234)
test.ind = sample(1:354, 50, replace = F) # select 50/354 of cases for testing, train on remaining (354-50)/354 cases
# UMich Data (includes MISSING data): use `mice` to impute missing data with mean: newData <- mice(data,m=5,maxit=50,meth='pmm',seed=500); summary(newData)
# wiki_url <- read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data_April2011_NI_IBS_Pain")
# IBSData<- html_table(html_nodes(wiki_url, "table")[[1]]) # load Table 1
# set.seed(1234)
# test.ind = sample(1:337, 50, replace = F) # select 50/337 of cases for testing, train on remaining (337-50)/337 cases
# summary(IBSData); IBSData[IBSData=="."] <- NA; newData <- mice(IBSData,m=5,maxit=50,meth='pmm',seed=500); summary(newData)
html_nodes(wiki_url, "#content")## {xml_nodeset (1)}
## [1] <div id="content">\n\t\t<a name="top" id="top"></a>\n\t\t\t\t<h1 id= ...
# View (IBSData); dim(IBSData): Select an outcome response "DX"(3), "FS_IQ" (5)
# scale/normalize all input variables
IBSData <- na.omit(IBSData)
IBSData[,4:66] <- scale(IBSData[,4:66]) # scale the entire dataset
train.x = data.matrix(IBSData[-test.ind, c(4:66)]) # exclude outcome
train.y = IBSData[-test.ind, 3]-1
test.x = data.matrix(IBSData[test.ind, c(4:66)])
test.y = IBSData[test.ind, 3]-1
# View(data.frame(train.x, train.y))
# View(data.frame(test.x, test.y))
# table(test.y); table(train.y)
# num.round - number of iterations to train the model
act <- mx.symbol.Variable("data")
fc <- mx.symbol.FullyConnected(act, num.hidden = 10)
act <- mx.symbol.Activation(data = fc, act_type = "sigmoid")
fc <- mx.symbol.FullyConnected(act, num.hidden = 2)
mlp <- mx.symbol.SoftmaxOutput(fc)
mx.set.seed(2235)
model <- mx.model.FeedForward.create(
symbol = mlp,
array.batch.size=20,
X = train.x, y=train.y,
num.round = 200,
array.layout = "rowmajor",
learning.rate = exp(-1),
eval.metric = mx.metric.accuracy, verbose=FALSE)
preds = predict(model, test.x)
pred.label = max.col(t(preds))-1; table(pred.label, test.y)## test.y
## pred.label 0 1
## 0 26 11
## 1 7 6
library("crossval")
diagnosticErrors(crossval::confusionMatrix(pred.label,test.y,negative = 0))## acc sens spec ppv npv lor
## 0.6400000 0.4615385 0.7027027 0.3529412 0.7878788 0.7060506
## attr(,"negative")
## [1] 0
ggplot(data.frame(test.y, preds[2,]),
aes(x=preds[2,], group=test.y, fill=as.factor(test.y)))+
geom_histogram(position="dodge",binwidth=0.25)+theme_bw()# convert pred-probability to binary classes threshold=0.3?
bin_preds <- ifelse (preds[2,]<0.3, 0, 1)
# get a factor variable comparing binary test-labels vs. predcted-labels
label_match <- as.factor(ifelse (test.y==bin_preds, 0, 1))
p5 <- ggplot(data.frame(test.y, preds[2,]), aes(x = test.y, y = preds[2,], color=label_match))+geom_point()+ggtitle("Match between Test Data Labels and Predicted Labels")
p5This histogram plot suggests that the classification is not good.
Another case-study we have seen befre is the country quality of life (QoL) dataset. Let’s try to extime a netowkr model and use it to predict the overall QoL.
wiki_url <- read_html("http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_2008_World_CountriesRankings")
html_nodes(wiki_url, "#content")## {xml_nodeset (1)}
## [1] <div id="content">\n\t\t<a name="top" id="top"></a>\n\t\t\t\t<h1 id= ...
CountryRankingData<- html_table(html_nodes(wiki_url, "table")[[2]])
# View (CountryRankingData); dim(CountryRankingData): Select an appropriate
# outcome "OA": Overall country ranking (13)
# Dichotomize outcome, Top-countries OA<20, bottom countries OA>=20
set.seed(1234)
test.ind = sample(1:100, 30, replace = F) # select 15/100 of cases for testing, train on remaining 85/100 cases
CountryRankingData[,c(8:12,14)] <- scale(CountryRankingData[,c(8:12,14)])
# scale/normalize all input variables
train.x = data.matrix(CountryRankingData[-test.ind, c(8:12,14)]) # exclude outcome
train.y = ifelse(CountryRankingData[-test.ind, 13] < 50, 1, 0)
test.x = data.matrix(CountryRankingData[test.ind, c(8:12,14)])
test.y = ifelse(CountryRankingData[test.ind, 13] < 50, 1, 0) # developed (high OA rank) country
# View(data.frame(train.x, train.y)); View(data.frame(test.x, test.y))
# View(data.frame(CountryRankingData, ifelse(CountryRankingData[,13] < 20, 1, 0)))
act <- mx.symbol.Variable("data")
fc <- mx.symbol.FullyConnected(act, num.hidden = 10)
act <- mx.symbol.Activation(data = fc, act_type = "sigmoid")
fc <- mx.symbol.FullyConnected(act, num.hidden = 2)
mlp <- mx.symbol.SoftmaxOutput(fc)
mx.set.seed(2235)
model <- mx.model.FeedForward.create(
symbol = mlp,
array.batch.size=10,
X = train.x, y=train.y,
num.round = 15,
array.layout = "rowmajor",
learning.rate = exp(-1),
eval.metric = mx.metric.accuracy)
preds = predict(model, test.x); preds## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.4858414 0.8813806 0.005079126 0.00586758 0.8596867 0.8433116
## [2,] 0.5141586 0.1186193 0.994920909 0.99413246 0.1403132 0.1566884
## [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 0.4339298 0.6440557 0.98306561 0.6904048 0.98700458 0.9694005 0.58468
## [2,] 0.5660702 0.3559443 0.01693436 0.3095952 0.01299545 0.0305995 0.41532
## [,14] [,15] [,16] [,17] [,18] [,19]
## [1,] 0.91731000 0.8294029 0.01788974 0.7494019 0.01260699 0.004735005
## [2,] 0.08269002 0.1705971 0.98211020 0.2505980 0.98739308 0.995265007
## [,20] [,21] [,22] [,23] [,24] [,25]
## [1,] 0.7692738 0.9511481 0.004696225 0.07467786 0.004864129 0.03685245
## [2,] 0.2307262 0.0488519 0.995303750 0.92532218 0.995135844 0.96314758
## [,26] [,27] [,28] [,29] [,30]
## [1,] 0.006151678 0.91600549 0.159912 0.9165047 0.97908646
## [2,] 0.993848383 0.08399452 0.840088 0.0834953 0.02091362
pred.label = max.col(t(preds))-1; table(pred.label, test.y)## test.y
## pred.label 0 1
## 0 17 0
## 1 1 12
We only need 15 rounds to achieve 97% accuracy.
ggplot(data.frame(test.y, preds[2,]),
aes(x=preds[2,], group=test.y, fill=as.factor(test.y)))+
geom_histogram(position="dodge",binwidth=0.25)+theme_bw()#calculate sensitivity & specificity and more
library("crossval")
diagnosticErrors(crossval::confusionMatrix(pred.label,test.y,negative = 0))## acc sens spec ppv npv lor
## 0.9666667 0.9230769 1.0000000 1.0000000 0.9444444 Inf
## attr(,"negative")
## [1] 0
# convert pred-probability to binary classes threshold=0.5?
bin_preds <- ifelse (preds[2,]<0.5, 0, 1)
# get a factor variable comparing binary test-labels vs. predicted-labels
label_match <- as.factor(ifelse (test.y==bin_preds, 0, 1))
p6 <- ggplot(data.frame(test.y, preds[2,]), aes(x = test.y, y = preds[2,], color=label_match))+geom_point()+ggtitle("Match between Test Data Labels and Predicted Labels")
p6In Chapter 10 (ML, NN, SVM Classification) we discussed Optical Character Recognition (OCR). Specifically, we analyzed handwritten notes (unstructured text) and converted it to printed text.
MNIST includes a large set of human annotated/labeled handwritten digits imaging data set. Every digit is represented by a \(28\times 28\) thumbnail image. You can download the training and testing data from Kaggle.
The train.csv and test.csv data files contain gray-scale images of hand-drawn digits, \(0, 1, 2, \ldots, 9\). Each 2D image is \(28\times 28\) in size and each of the \(784\) pixels has a single pixel-intensity representing the lightness or darkness of that pixel (stored as a 1 byte integer \([0,255]\)). Higher intensities correspond to darker pixels.
The training data, train.csv, has 785 columns, where the first column, label, codes the actual the digit drawn by the user. The remaining \(784\) columns contain the \(28\times 28=784\) pixel-intensities of the associated 2D image. Columns in the training set have \(pixel_K\) names, where \(0\leq K\leq 783\). To reconstruct a 2D image out of each row in the training data we use this relation between pixel-index (\(K\)) and \(X,Y\) image coordinates: \[K = Y \times 28 + X,\] where \(0\leq X, Y\leq 27\). Thus, \(pixel_K\) is located on row \(Y\) and column \(X\) of the corresponding 2D Image of size \(28\times 28\). For instance, \(pixel_{60=(2\times 28 + 4)} \longleftrightarrow (X=4,Y=2)\) represents the pixel on the 3-rd row and 5-th column in the image. Diagrammatically, omitting the “pixel” prefix, the pixels may be ordered to reconstruct the 2D image as follows:
| Row | Col0 | Col1 | Col2 | Col3 | Col4 | … | Col26 | Co27 |
|---|---|---|---|---|---|---|---|---|
| Row0 | 000 | 001 | 002 | 003 | 004 | … | 026 | 027 |
| Row1 | 028 | 029 | 030 | 031 | 032 | … | 054 | 055 |
| Row2 | 056 | 057 | 058 | 059 | 060 | … | 082 | 083 |
| RowK | … | … | … | … | … | … | … | … |
| Row26 | 728 | 729 | 730 | 731 | 732 | … | 754 | 755 |
| Row27 | 756 | 757 | 758 | 759 | 760 | … | 782 | 783 |
Note that the point-to-pixelID transformation (\(K = Y \times 28 + X\)) may easily be inverted as a pixelID-to-point mapping: \(X= K\ mod\ 28\) (remainder of the integer division (\(K/28\)) and \(Y=K %/%28\) (integer part of the division \(K/28\))). For example:
K <- 60
X <- K %% 28 # X= K mod 28, remainder of integer division 60/28
Y <- K%/%28 # integer part of the division
# This validates that the application of both, the back and forth transformations, leads to an identity
K; X; Y; Y * 28 + X## [1] 60
## [1] 4
## [1] 2
## [1] 60
The test data (test.csv) has the same organization as the training data, except that it does not contain the first label column. It includes 28,000 images and we can predict image labels that can be stored as \(ImageId,Label\) pairs, which can be visually compared to the 2D images for validation/inspection.
require(mxnet)
# train.csv
pathToZip <- tempfile()
download.file("http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/data/DigitRecognizer_TrainingData.zip", pathToZip)
train <- read.csv(unzip(pathToZip))
dim(train)## [1] 42000 785
unlink(pathToZip)
# test.csv
pathToZip <- tempfile()
download.file("http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/data/DigitRecognizer_TestingData.zip", pathToZip)
test <- read.csv(unzip(pathToZip))
dim(test)## [1] 28000 784
unlink(pathToZip)
train <- data.matrix(train)
test <- data.matrix(test)
train.x <- train[,-1]
train.y <- train[,1]
# Scaling will be discussed below
train.x <- t(train.x/255)
test <- t(test/255)Let’s look at some of these example images:
library("imager")
# first convert the CSV data (one row per image, 28,000 rows)
array_3D <- array(test, c(28, 28, 28000))
mat_2D <- matrix(array_3D[,,1], nrow = 28, ncol = 28)
plot(as.cimg(mat_2D))# extract all N=28,000 images
N <- 28000
img_3D <- as.cimg(array_3D[,,], 28, 28, N)
# plot the k-th image (1<=k<=N)
k <- 5
plot(img_3D, k)image_2D <- function(img,index)
{
img[,,index,,drop=FALSE]
}
plot(image_2D(img_3D, 1))# Plot a collage of the first 4 images
imappend(list(image_2D(img_3D, 1), image_2D(img_3D, 2), image_2D(img_3D, 3), image_2D(img_3D, 4)),"y") %>% plot# img <- image_2D(img_3D, 1)
# for (i in 10:20) { imappend(list(img, image_2D(img_3D, i)),"x") }In these CSV data files, each \(28\times 28\) image is represented as a single row. Greyscale images are \(1\) byte, in the range \([0, 255]\), which we linearly transformed into \([0,1]\). Note that we only scale the \(X\) input, not the output (labels). Also, we don’t have manual gold-standard validation labels for the testing data, i.e., test.y is not available for the handwritten digits data.
# We already scaled earlier
# train.x <- t(train.x/255)
# test <- t(test/255)Next, we can transpose the input matrix to \(n\ (pixels) \times m\ (examples)\), as column major format required by mxnet. The image labels are evenly distributed:
table(train.y); prop.table(table(train.y))## train.y
## 0 1 2 3 4 5 6 7 8 9
## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188
## train.y
## 0 1 2 3 4 5
## 0.09838095 0.11152381 0.09945238 0.10359524 0.09695238 0.09035714
## 6 7 8 9
## 0.09850000 0.10478571 0.09673810 0.09971429
The majority class (1) in the training set includes 11.2% of the observations.
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")data <- mx.symbol.Variable(“data”) represents the input layer. The first hidden layer, set by fc1 <- mx.symbol.FullyConnected(data, name=“fc1”, num_hidden=128), takes the data as an input, its name, and the number of hidden neurons to generate an output layer.
act1 <- mx.symbol.Activation(fc1, name=“relu1”, act_type=“relu”) sets the activation function, which takes the output from the first hidden layer “fc1”" and generate an output that is fed into the second hidden layer “fc2” that uses fewer hidden neurons (64).
The process repeats with the second activation “act2”, resembling “act1” but using different input source and name. As there are only 10 digits (\(0, 1, ..., 9\)), in the last layer “fc3”, we set the number of neurons to 10. At the end, we set the activation to SoftmaxOutput to obtain probabilistic prediction outcomes.
We are almost ready for the training process. Before we start the computation, let’s decide what device should we use.
devices <- mx.cpu()Here we assign CPU to mxnet. After all these preparations, we can run the following command to train the neural network! Note that in mxnet, the correct function to control the random process is mx.set.seed.
mx.set.seed(1234)
model <- mx.model.FeedForward.create(softmax, X=train.x, y=train.y,
ctx=devices, num.round=10, array.batch.size=100,
learning.rate=0.07, momentum=0.9, eval.metric=mx.metric.accuracy,
initializer=mx.init.uniform(0.07),
epoch.end.callback=mx.callback.log.train.metric(100)
)For 10 rounds, the training accuracy is exceeds 99% it may not be worthwhile trying 100 rounds as this would increase substantially the computational complexity.
Next, we will generate forecasting using the model on a testing data, and evaluate the prediction performance. The preds matrix has \(28,000\) rows and \(10\) columns, containing the desired classification probabilities from the output layer of the neural net. To extract the maximum label for each row, we can use the max.col:
# evaludate: "preds" is the matrix of the possibility of each of the 10 numbers
preds <- predict(model, test)
pred.label <- max.col(t(preds)) - 1 # "-1", as the digits are 0, 1, ..., 9 (excluding 10)
table(pred.label)## pred.label
## 0 1 2 3 4 5 6 7 8 9
## 2740 3177 2788 2858 2639 2472 2725 2881 2782 2938
# preds1 <- ifelse(preds[2,] <= 0.5, 0, 1) # dichotomize to labels
# pred.label = t(preds1)
# table(pred.label, test.y)
# calculate sensitivity & specificity
# sensitivity(factor(preds1), factor(as.numeric(test.y)),positive = 1)
# specificity(factor(preds1), factor(as.numeric(test.y)),negative = 0)
# preds <- predict(model, test.x)
# dim(preds)
# preds1 <- ifelse(preds[2,] <= 0.5, 0, 1) # dichotomize to labels
# pred.label = t(preds1)
# table(pred.label, test.y)For binary classification, mxnet outputs two prediction classes, whereas for multi-class predictions, it outputs \(n\ (classes) \times m\ (examples)\) matrix, where the rows correspond to the probability of the class in the specific column, so all column sums add up to \(1.0\).
The predictions are stored in a \(28,000(rows) \times 10 (colums)\) matrix, including the desired classification probabilities from the output layer. The R max.col function extracts the maximum label for each row.
pred.label <- max.col(t(preds)) - 1
table(pred.label)## pred.label
## 0 1 2 3 4 5 6 7 8 9
## 2740 3177 2788 2858 2639 2472 2725 2881 2782 2938
We can save the predicted labels of the testing handwritten digits to CSV:
predicted_lables <- data.frame(ImageId=1:ncol(test), Label=pred.label)
write.csv(predicted_lables, file='predicted_lables.csv', row.names=FALSE, quote=FALSE)We can open the predicted_lables.csv file and inspect the ML-labels (saved in the 2-column ImageID and Label format CSV) assigned to the \(28,000\) manually drawn digits. As the testing handwritten digits data do not have human-provided labels, we can’t quantitatively assess the validity of the algorithm on the testing data. However, we can visually inspect random handwritten digit instances (\(7\) in the example below, image indices \(4:10\)) against their predictions and gain intuition of the accuracy rate of the ML classifier.
table(train.y)## train.y
## 0 1 2 3 4 5 6 7 8 9
## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188
table(predicted_lables[,2])##
## 0 1 2 3 4 5 6 7 8 9
## 2740 3177 2788 2858 2639 2472 2725 2881 2782 2938
# Plot the relative frequencies between the number of train.y labels (in range 0-9) against the testing data predicted labels.
# These are not directly related (training.y vs. predicted.testing.data!
# Remember - we don't have gold-standard test.y labels!
# Generally speaking, numbers closer to the diagonal suggest expected good classifications.
# Whereas, off diagonal points may suggest less effective labeling.
label.names <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9")
plot(ftable(train.y)[c(1:10)], ftable(predicted_lables[,2])[c(1:10)])
text(ftable(train.y)[c(1:10)]+20, ftable(predicted_lables[,2])[c(1:10)], labels=label.names, cex= 1.2)# For example, the ML-classification labels assigned to the first 7 images (from the 28,000 testing data collection) are:
head(predicted_lables, n = 7L)## ImageId Label
## 1 1 2
## 2 2 0
## 3 3 9
## 4 4 9
## 5 5 3
## 6 6 7
## 7 7 0
library(knitr)
kable(head(predicted_lables, n = 7L), format = "markdown", align='c')| ImageId | Label |
|---|---|
| 1 | 2 |
| 2 | 0 |
| 3 | 9 |
| 4 | 9 |
| 5 | 3 |
| 6 | 7 |
| 7 | 0 |
#initialize a list of m=7 images from the N=28,000 available images
m_start <- 4
m_end <- 10
if (m_end <= m_start)
{ m_end = m_start+1 } # check that m_end > m_start
label_Ypositons <- vector() # initialize the array of label positions on the plot
for (i in m_start:m_end) {
if (i==m_start) {
img1 <- image_2D(img_3D, m_start)
}
else img1 <- imappend(list(img1, image_2D(img_3D, i)),"y")
label.names[i+1-m_start] <- predicted_lables[i,2]
label_Ypositons[i+1-m_start] <- 15 + 28*(i-m_start)
}
plot(img1, axes=FALSE)
text(40, label_Ypositons, labels=label.names[1:(m_end-m_start)], cex= 1.2, col="blue")
mtext(paste((m_end+1-m_start), " Random Images \n Indices (m_start=", m_start, " : m_end=", m_end, ")"), side=2, line=-6, col="black")
mtext("ML Classification Labels", side=4, line=-5, col="blue") table(ftable(train.y)[c(1:10)], ftable(predicted_lables[,2])[c(1:10)])##
## 2472 2639 2725 2740 2782 2788 2858 2881 2938 3177
## 3795 1 0 0 0 0 0 0 0 0 0
## 4063 0 0 0 0 1 0 0 0 0 0
## 4072 0 1 0 0 0 0 0 0 0 0
## 4132 0 0 0 1 0 0 0 0 0 0
## 4137 0 0 1 0 0 0 0 0 0 0
## 4177 0 0 0 0 0 1 0 0 0 0
## 4188 0 0 0 0 0 0 0 0 1 0
## 4351 0 0 0 0 0 0 1 0 0 0
## 4401 0 0 0 0 0 0 0 1 0 0
## 4684 0 0 0 0 0 0 0 0 0 1
LeNetWe can use the mxnet package LeNet convolutional neural network (CNN) protocol for learning the network.
# input
data <- mx.symbol.Variable('data')
# first conv
conv1 <- mx.symbol.Convolution(data=data, kernel=c(5,5), num_filter=20)
tanh1 <- mx.symbol.Activation(data=conv1, act_type="tanh")
pool1 <- mx.symbol.Pooling(data=tanh1, pool_type="max",
kernel=c(2,2), stride=c(2,2))
# second conv
conv2 <- mx.symbol.Convolution(data=pool1, kernel=c(5,5), num_filter=50)
tanh2 <- mx.symbol.Activation(data=conv2, act_type="tanh")
pool2 <- mx.symbol.Pooling(data=tanh2, pool_type="max",
kernel=c(2,2), stride=c(2,2))
# first fullc
flatten <- mx.symbol.Flatten(data=pool2)
fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=500)
tanh3 <- mx.symbol.Activation(data=fc1, act_type="tanh")
# second fullc
fc2 <- mx.symbol.FullyConnected(data=tanh3, num_hidden=10)
# loss
lenet <- mx.symbol.SoftmaxOutput(data=fc2)train.array <- train.x
dim(train.array) <- c(28, 28, 1, ncol(train.x))
test.array <- test
dim(test.array) <- c(28, 28, 1, ncol(test))Start by defining the devices:
n.gpu <- 1
device.cpu <- mx.cpu()
device.gpu <- lapply(0:(n.gpu-1), function(i) {
mx.gpu(i)
})Passing a list of devices is useful for high-end computational platforms (e.g., multi-GPU systems); mxnet can train on multiple GPUs or CPUs.
To train using the CPU, try fewer iterations as protocol is computationally very intense.
mx.set.seed(1234)
tic <- proc.time()
model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y,
ctx=device.cpu, num.round=1, array.batch.size=100,
learning.rate=0.05, momentum=0.9, wd=0.00001,
eval.metric=mx.metric.accuracy,
epoch.end.callback=mx.callback.log.train.metric(100))
print(proc.time() - tic)## user system elapsed
## 132.16 58.28 38.31
The corresponding training on GPU is similar, but it requires a separate GPU-compilation of mxnet (/mxnet/src/storage/storage.cc:78) with USE_CUDA=1 to enable GPU usage.
mx.set.seed(1234)
tic <- proc.time()
model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y,
ctx=device.gpu, num.round=5, array.batch.size=100,
learning.rate=0.05, momentum=0.9, wd=0.00001,
eval.metric=mx.metric.accuracy,
epoch.end.callback=mx.callback.log.train.metric(100))
print(proc.time() - tic)GPU training is faster than CPU. Everyone can submit a new classification result to Kaggle and see a ranking result for their classifier. Make sure you follow the specific result-file submission format.
preds <- predict(model, test.array)
pred.label <- max.col(t(preds)) - 1
submission <- data.frame(ImageId=1:ncol(test), Label=pred.label)
write.csv(submission, file='submission.csv', row.names=FALSE, quote=FALSE)A real-world example of deep learning is classification of 2D images (pictures) or 3D volumes (e.g., neuroimages).
This example shows the use a pre-trained Inception-BatchNorm Network to predict a class of real world image. The network architecture is described this paper. The pre-trained Inception-BatchNorm network is available here and here. This advanced model gives a state-of-the-art prediction accuracy on imaging data. We also need the imager package to load and preprocess the images in R.
# install.packages("imager")
require(mxnet)
require(imager)Download and unzip the pre-trained model to a working folder and load the model and the mean image (used for preprocessing using mx.nd.load) into R. This download can either be done manually, or automated, as shown below.
pathToZip <- tempfile()
download.file("http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/data/Inception.zip", pathToZip)
model_file <- unzip(pathToZip)
# setwd(paste(getwd(),"results", sep='/'))
model = mx.model.load(paste(getwd(),"Inception_BN", sep='/'), iteration=39)
mean.img = as.array( mx.nd.load(
paste(getwd(),"mean_224.nd", sep='/')
)
[["mean_img"]]
)
dim(mean.img)## [1] 224 224 3
# plot(mean.img)To classify a new image, select an image and load it in. Below, we show the classification of several alternative images.
library("imager")
# One should be able to load the image directly from the web (but sometimes there may be problems, in which case, we need to first download the image and then load it in R:
# im <- imager::load.image("http://wiki.socr.umich.edu/images/6/69/DataManagementFig1.png")
# download file to local working directory, use "wb" mode to avoid problems
download.file("http://wiki.socr.umich.edu/images/6/69/DataManagementFig1.png", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')
# report download image path
paste(getwd(),"results/image.png", sep="/")## [1] "E:/Ivo.dir/Research/UMichigan/Proposals/2014/UMich_MINDS_DataSciInitiative_2014/2016/MIDAS_HSXXX_DataScienceAnalyticsCourse_Porposal/MOOC_Implementation/results/image.png"
img <- load.image(paste(getwd(),"results/image.png", sep="/"))
dim(img)## [1] 1875 1084 1 4
plot(img)Before feeding the image to the deep learning network for classification, we need to do some preprocessing to make it fit the deepnet input requirements. This image preprocessing (cropping and subtraction of the mean) can be done directly in R.
preproc.image <-function(im, mean.image) {
# crop the image
shape <- dim(im)
short.edge <- min(shape[1:2])
xx <- floor((shape[1] - short.edge) / 2)
yy <- floor((shape[2] - short.edge) / 2)
cropped <- crop.borders(im, xx, yy)
# resize to 224 x 224, needed by input of the model.
resized <- resize(cropped, 224, 224)
plot(resized)
# convert to array (x, y, channel)
arr <- as.array(resized[,,,c(1:3)]) * 255
plot(as.cimg(arr))
dim(arr) <- c(224, 224, 3)
# subtract the mean
normed <- arr - mean.img
# Reshape to format needed by mxnet (width, height, channel, num)
dim(normed) <- c(224, 224, 3, 1)
return(normed)
}Call the preprocessing function with the normalized image.
normed <- preproc.image(img, mean.img)# plot(normed)Use the predict function to get the probability over all (learned) classes.
prob <- predict(model, X=normed)
dim(prob)## [1] 1000 1
The prob prediction generates a \(1000 \times 1\) array representing the probability of the input image to resemble (be classified as) the top 1,000 known image categories. We can report the indices of the top-10 closest image classes to the input image:
max.idx <- order(prob[,1], decreasing = TRUE)[1:10]
max.idx## [1] 855 563 229 581 620 948 951 186 204 311
Alternatively, we can map of these top-10 indices into named image-classes.
synsets <- readLines("synset.txt")
print(paste0("Top Predicted Image-Label Classes: Name=", synsets[max.idx], "; Probability: ", prob[max.idx]))## [1] "Top Predicted Image-Label Classes: Name=n04418357 theater curtain, theatre curtain; Probability: 0.0493969731032848"
## [2] "Top Predicted Image-Label Classes: Name=n03388043 fountain; Probability: 0.0431816577911377"
## [3] "Top Predicted Image-Label Classes: Name=n02105505 komondor; Probability: 0.0371580235660076"
## [4] "Top Predicted Image-Label Classes: Name=n03457902 greenhouse, nursery, glasshouse; Probability: 0.0368416681885719"
## [5] "Top Predicted Image-Label Classes: Name=n03637318 lampshade, lamp shade; Probability: 0.0317880921065807"
## [6] "Top Predicted Image-Label Classes: Name=n07734744 mushroom; Probability: 0.029257345944643"
## [7] "Top Predicted Image-Label Classes: Name=n07747607 orange; Probability: 0.0284677185118198"
## [8] "Top Predicted Image-Label Classes: Name=n02094114 Norfolk terrier; Probability: 0.0268963072448969"
## [9] "Top Predicted Image-Label Classes: Name=n02098286 West Highland white terrier; Probability: 0.0257414449006319"
## [10] "Top Predicted Image-Label Classes: Name=n02219486 ant, emmet, pismire; Probability: 0.0205501019954681"
Clearly, this US weather pattern image is not well classified. The optimal perdition suggests this may be a theater curtain, however, the confidence is very low, \(Prob\sim 0.049\). None of the other top-10 classes capture the type of the actual image either.
The machine learning image classifications results won’t always be this poor. Let’s try classifying several alternative images.
Let’s try the automated image classification of this lakeside panorama.
download.file("https://upload.wikimedia.org/wikipedia/commons/2/23/Lake_mapourika_NZ.jpeg", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')
im <- load.image(paste(getwd(),"results/image.png", sep="/"))
plot(im)normed <- preproc.image(im, mean.img)## Warning in as.cimg.array(arr): Assuming third dimension corresponds to
## colour
prob <- predict(model, X=normed)
max.idx <- order(prob[,1], decreasing = TRUE)[1:10]
print(paste0("Top Predicted Image-Label Classes: Name=", synsets[max.idx], "; Probability: ", prob[max.idx]))## [1] "Top Predicted Image-Label Classes: Name=n02894605 breakwater, groin, groyne, mole, bulwark, seawall, jetty; Probability: 0.648900210857391"
## [2] "Top Predicted Image-Label Classes: Name=n03216828 dock, dockage, docking facility; Probability: 0.18300749361515"
## [3] "Top Predicted Image-Label Classes: Name=n09332890 lakeside, lakeshore; Probability: 0.127718284726143"
## [4] "Top Predicted Image-Label Classes: Name=n03160309 dam, dike, dyke; Probability: 0.0115785356611013"
## [5] "Top Predicted Image-Label Classes: Name=n03095699 container ship, containership, container vessel; Probability: 0.00913789495825768"
## [6] "Top Predicted Image-Label Classes: Name=n09428293 seashore, coast, seacoast, sea-coast; Probability: 0.00438630860298872"
## [7] "Top Predicted Image-Label Classes: Name=n03933933 pier; Probability: 0.00410781195387244"
## [8] "Top Predicted Image-Label Classes: Name=n02859443 boathouse; Probability: 0.00246215332299471"
## [9] "Top Predicted Image-Label Classes: Name=n09399592 promontory, headland, head, foreland; Probability: 0.00168424530420452"
## [10] "Top Predicted Image-Label Classes: Name=n09421951 sandbar, sand bar; Probability: 0.0010681493440643"
This photo does represent a lakeside, which is reflected by the top three class labels:
Another costal boundary between water and land is represented in this beach image.
download.file("https://upload.wikimedia.org/wikipedia/commons/9/90/Holloways_beach_1920x1080.jpg", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')
im <- load.image(paste(getwd(),"results/image.png", sep="/"))
plot(im)normed <- preproc.image(im, mean.img)prob <- predict(model, X=normed)
max.idx <- order(prob[,1], decreasing = TRUE)[1:10]
print(paste0("Top Predicted Image-Label Classes: Name=", synsets[max.idx], "; Probability: ", prob[max.idx]))## [1] "Top Predicted Image-Label Classes: Name=n09421951 sandbar, sand bar; Probability: 0.690395057201385"
## [2] "Top Predicted Image-Label Classes: Name=n09332890 lakeside, lakeshore; Probability: 0.202825039625168"
## [3] "Top Predicted Image-Label Classes: Name=n09428293 seashore, coast, seacoast, sea-coast; Probability: 0.0899281576275826"
## [4] "Top Predicted Image-Label Classes: Name=n02894605 breakwater, groin, groyne, mole, bulwark, seawall, jetty; Probability: 0.00669284025207162"
## [5] "Top Predicted Image-Label Classes: Name=n09399592 promontory, headland, head, foreland; Probability: 0.00204332964494824"
## [6] "Top Predicted Image-Label Classes: Name=n02859443 boathouse; Probability: 0.00106108351610601"
## [7] "Top Predicted Image-Label Classes: Name=n02951358 canoe; Probability: 0.000664844235870987"
## [8] "Top Predicted Image-Label Classes: Name=n09246464 cliff, drop, drop-off; Probability: 0.00041632293141447"
## [9] "Top Predicted Image-Label Classes: Name=n04357314 sunscreen, sunblock, sun blocker; Probability: 0.000338666723109782"
## [10] "Top Predicted Image-Label Classes: Name=n04606251 wreck; Probability: 0.00029250411898829"
This photo was classified appropriately and with high-confidence as:
Here is another natural image representing the Mount St. Helens Vocano.
download.file("https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/MSH82_st_helens_plume_from_harrys_ridge_05-19-82.jpg/1200px-MSH82_st_helens_plume_from_harrys_ridge_05-19-82.jpg", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')
im <- load.image(paste(getwd(),"results/image.png", sep="/"))
plot(im)normed <- preproc.image(im, mean.img)prob <- predict(model, X=normed)
max.idx <- order(prob[,1], decreasing = TRUE)[1:10]
print(paste0("Top Predicted Image-Label Classes: Name=", synsets[max.idx], "; Probability: ", prob[max.idx]))## [1] "Top Predicted Image-Label Classes: Name=n09472597 volcano; Probability: 0.993182718753815"
## [2] "Top Predicted Image-Label Classes: Name=n09288635 geyser; Probability: 0.00681292032822967"
## [3] "Top Predicted Image-Label Classes: Name=n09193705 alp; Probability: 4.15805243392242e-06"
## [4] "Top Predicted Image-Label Classes: Name=n03344393 fireboat; Probability: 1.48332972571552e-07"
## [5] "Top Predicted Image-Label Classes: Name=n04310018 steam locomotive; Probability: 1.17537535260226e-08"
## [6] "Top Predicted Image-Label Classes: Name=n03388043 fountain; Probability: 7.44438377608958e-09"
## [7] "Top Predicted Image-Label Classes: Name=n04228054 ski; Probability: 2.90055357510255e-09"
## [8] "Top Predicted Image-Label Classes: Name=n02950826 cannon; Probability: 2.27150032117152e-09"
## [9] "Top Predicted Image-Label Classes: Name=n03773504 missile; Probability: 1.69992575571598e-09"
## [10] "Top Predicted Image-Label Classes: Name=n04613696 yurt; Probability: 1.25635724046447e-09"
The predicted top class labels for this image are perfect:
The next image represents a 2D snapshot of 3D shape reconstruction of a brain cortical surface. This image is particularly difficult to automatically classify because (1) few people have ever seen a real brain, (2) the mathematical and computational models used to obtain the 2D manifold representing the brain surface do vary, and (3) the patterns of sulcal folds and gyral crests are quite inconsistent between people.
download.file("http://wiki.socr.umich.edu/images/e/ea/BrainCortex2.png", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')
im <- load.image(paste(getwd(),"results/image.png", sep="/"))
plot(im)normed <- preproc.image(im, mean.img)prob <- predict(model, X=normed)
max.idx <- order(prob[,1], decreasing = TRUE)[1:10]
print(paste0("Top Predicted Image-Label Classes: Name=", synsets[max.idx], "; Probability: ", prob[max.idx]))## [1] "Top Predicted Image-Label Classes: Name=n01917289 brain coral; Probability: 0.497430354356766"
## [2] "Top Predicted Image-Label Classes: Name=n07734744 mushroom; Probability: 0.229991808533669"
## [3] "Top Predicted Image-Label Classes: Name=n13052670 hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa; Probability: 0.0925178006291389"
## [4] "Top Predicted Image-Label Classes: Name=n03598930 jigsaw puzzle; Probability: 0.0433992072939873"
## [5] "Top Predicted Image-Label Classes: Name=n07718747 artichoke, globe artichoke; Probability: 0.0150046302005649"
## [6] "Top Predicted Image-Label Classes: Name=n07860988 dough; Probability: 0.0124379647895694"
## [7] "Top Predicted Image-Label Classes: Name=n07715103 cauliflower; Probability: 0.0115451822057366"
## [8] "Top Predicted Image-Label Classes: Name=n12985857 coral fungus; Probability: 0.010999240912497"
## [9] "Top Predicted Image-Label Classes: Name=n07714990 broccoli; Probability: 0.00909160356968641"
## [10] "Top Predicted Image-Label Classes: Name=n03637318 lampshade, lamp shade; Probability: 0.0075435321778059"
The top class labels for the brain image are:
Imagine if we can train a brain image classifier that labels individuals (volunteers or patients) solely based on their brain scans into different classes reflecting their development, clinical phenotypes, disease states, or aging profiles. This will require a substantial amount of expert-labeled brain scans, model training and extensive validation. However any progress in this direction will lead to effective computational clinical decision support systems that can assist physicians with diagnosis, tracking, and prognostication of brain growth and aging in health and disease.
The last example is a synthetic computer-generated image representing a cartoon face or a mask.
download.file("http://wiki.socr.umich.edu/images/f/fb/FaceMask1.png", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')
im <- load.image(paste(getwd(),"results/image.png", sep="/"))
plot(im)normed <- preproc.image(im, mean.img)prob <- predict(model, X=normed)
max.idx <- order(prob[,1], decreasing = TRUE)[1:10]
print(paste0("Top Predicted Image-Label Classes: Name=", synsets[max.idx], "; Probability: ", prob[max.idx]))## [1] "Top Predicted Image-Label Classes: Name=n03724870 mask; Probability: 0.376200884580612"
## [2] "Top Predicted Image-Label Classes: Name=n04229816 ski mask; Probability: 0.253164976835251"
## [3] "Top Predicted Image-Label Classes: Name=n02708093 analog clock; Probability: 0.0562066771090031"
## [4] "Top Predicted Image-Label Classes: Name=n02865351 bolo tie, bolo, bola tie, bola; Probability: 0.029578497633338"
## [5] "Top Predicted Image-Label Classes: Name=n04192698 shield, buckler; Probability: 0.027849780395627"
## [6] "Top Predicted Image-Label Classes: Name=n03590841 jack-o'-lantern; Probability: 0.0175030268728733"
## [7] "Top Predicted Image-Label Classes: Name=n02974003 car wheel; Probability: 0.0172393918037415"
## [8] "Top Predicted Image-Label Classes: Name=n07892512 red wine; Probability: 0.0168519951403141"
## [9] "Top Predicted Image-Label Classes: Name=n03249569 drum, membranophone, tympan; Probability: 0.0141900591552258"
## [10] "Top Predicted Image-Label Classes: Name=n04447861 toilet seat; Probability: 0.0136017119511962"
The top class labels for the face mask are:
You can easily test the same image classifier on your own images and identify classes of pictures that are either well or porly classified by the deep learning based machine learning model.